CAGEF_services_slide.png

Advanced Graphics and Data Visualization in R

Lecture 02: ggplot2 and choosing the right visualization for your audience fixed

0.1.0 An overview of Advanced Graphics and Data Visualization in R

"Advanced Graphics and Data Visualization in R" is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1021 was developed to enhance the skills of students with basic backgrounds in R by focusing on available philosophies, methods, and packages for plotting scientific data. While the datasets and examples used in this course will be centred on SARS-CoV2 epidemiological and genomic data, the lessons learned herein will be broadly applicable.

This lesson is the second in a 6-part series. The aim for the end of this series is for students to recognize how to import, format, and display data based on their intended message and audience. The format and style of these visualizations will help to identify and convey the key message(s) from their experimental data.

The structure of the class is a code-along style in Jupyter notebooks. At the start of each lecture, skeleton versions of the lecture will be provided for use on the University of Toronto Jupyter Hub so students can program along with the instructor.


0.2.0 Lecture objectives

This week will delve into some helpful visualization available through the ggplot2 package. First we'll go over deciding which visualizations are best for what we want to accomplish and then explore some of these in more detail.

At the end of this lecture you will have covered the following topics

  1. The grammar of graphics.
  2. Scatterplots and their variants.
  3. Barplots and their variants.
  4. Density plots and their variants.
  5. Categorical plots and their variants.

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python
Red boxes: A comprehension question which may or may not involve a coding cell. You usually find these at the end of a section.

0.4.0 Data used in this lesson

Today's datasets will focus on some more Ontario public health unit data. We'll deep dive a bit more into the demographics of SARS-CoV-2 infection, hospitalization, and survival.

0.4.1 Dataset 2: PHU_population_information.csv

Retrieved from the open data sets available at Public Health Ontario, this data provides summary statistic information for all PHUs in Ontario including the population size for each PHU.

0.4.2 Dataset 1: Ontario_daily_change_in_cases_by_phu_long.tsv

This data originated from the Ontario Provincial government but was tidied during lecture 01 to reflect a long-format set of data compatible with our needs.

0.4.3 Dataset 3: Ontario_age_sex_COVID-19.csv

Retrieved from the open data sets available at Public Health Ontario, this dataset focuses on summarizing PHU data across age and sex groups.


0.5.0 Packages used in this lesson

repr- a package useful for altering some of the attributes of objects related to the R kernel.

tidyverse which has a number of packages including dplyr, tidyr, stringr, forcats and ggplot2

viridis helps to create color-blind palettes for our data visualizations

lubridate and zoo are helper packages used for working with date formats in R

ggbeeswarm, ggridges and GGally which are new packages we need to install. They'll help us generate some nice graphs.


1.0.0 Why is data visualization important?

How do we extract meaningful insights from our data? If you have previously explored Anscombe's quartet you'll know that, as scientists and lay people, we can sometime be obsessed with summary statistics - mean, median, mode, standard deviation. While these values are a helpful way to quickly assess a population, they can be flawed to the point of deception. Instead, we should temper our investigations by visualizing our data. A deeper understanding of your data trends and potential models comes from dissecting attributes of your data which can jump out more easily through visualization.

Equally important is the ability to convey our findings to others. The right visualizations, whether simplistic or complex, should effectively communicate our key message.


1.1.0 The grammar of graphics

One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:

The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework. By following a layered approach of defined components, it can be easy to build a visualization.

MajorComponentsGrammarGraphics.png The Major Components of the Grammar of Graphics by Dipanjan Sarkar

We can break down the above pyramid by the base components, building from the bottom upwards.

  1. Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?
  1. Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.
  1. Scale: do you need to scale/transform any values to fit your data within a range? This includes layers that map between the data and the aesthetics.
  1. Geometric objects: how will you display your data within your visualization. Which geom_* will you use?
  1. Statistics: are there additional summary statistics that should be included in the visualization? Some examples include central tendency, spread, confidence intervals, standard error, etc.
  1. Facets: will generating subplots of the data add a dimension to our visualization that would otherwise by lost?
  1. Coordinate system: will your visualization follow a classis cartesian, semi-log, polar, etc. coordinate system?

1.2.0 Cumulative caseload data by PHU

Let's look at a summarized data set this week with cumulative case counts across all PHUs in Ontario. What is nice about this data set is that it also carries some population data with it to give us a sense of proportions. Steps we'll take in working with this data:

  1. Open up the data with read_csv()
  2. Adjust the column names
  3. Drop the "recent" data but keep the cumulative data

Let's open that up with our friend read_csv()


Looking at the data itself, we want to perform the following wrangling actions:

  1. Rename the variable names to lower case.
  2. In the variable names, replace all of the ` with_` characters.
  3. Update the values in the Geographic area variable to remove excess information.
  4. Drop the variables Recent Case Count and Recent Case Rate.
  5. Remove the prefix word "Cumulative" from any variable names.
  6. Rename the Geographic area variable to public_health_unit.

We're now ready to start visualizing our data, but how will we go about doing it?

1.3.0 Which aspects of the data do we wish to highlight?

Let's take a look at the following chart from from Dr. Andrew V. Abela.

Abela.ChartChooser.png
From the flowchart above we'll mainly explore looking at relationships and composition using our dataset as a basis for our visualizations. We'll cover distributions and comparison with a more complex dataset later in this lecture.

2.0.0 Visualizations help us identify relationships and correlations

Depending on the nature of your data and its dimensions there are a number of plots to choose from to represent and convey relationships between your variables. These various plots can reveal trends, correlations, and ultimately relationships between variables. As an initial form of graphical assessment, scatterplots build a framework for exploring our data further.

Relationship Description Types of charts
Finding correlations in your data Scatterplot
Bubble chart
Lineplot
Heatmap
Showing connections between groups Arc diagram
Chord diagram
Connection map
Network diagram
Non-ribbon chord diagram
Tree diagram
Show relationships and connections between the data Heatmap
or showing correlations between two or more variables Marimekko Chart
Parallel coordinates plot
Radar chart
Venn diagram

In our first lecture, we covered the creation of lineplots. For today, we'll focus on just two relational plots: the scatterplot and it's variant the bubblechart. Parallel coordinate plots will also make an appearance later on during this lecture.


2.1.0 Scatterplots and bubblecharts are geom_point() graphs

When we try to examine relationships, one direction we can take is to look for trends by comparing one variable against another. From an experimental standpoint we can graph our independent variable on the x-axis, looking for changes in our dependent variable/measurement on the y-axis. This is most easily accomplished when both of your variables are on a continuous scale.

When trying to show correlations in your data between two to three variables you can use scatterplots. Remember there are some limitations when visualizing multi-dimensional data on a two-dimensional canvas. Overall these plots can convey to your audience the potential correlations between your variables or separations between groups based on their colour or shape.

2.1.1 Use the alpha parameter to help visualize overlapping data points

Let's begin with a scatterplot comparing the number of cases versus hospitalizations. We'll use the observations from each PHU as separate datapoints. Of note, within the geom_point() layer, we'll be updating the alpha parameter to change the transparency of our points.

Using the "alpha" parameter: Note that `alpha = 1` will make completely opaque points while `alpha = 0` will make completely transparent points. As points begin to overlap, they will create increasingly opaque areas in your visualization.

2.1.2 The bubbleplot adds an obvious dimensionality to your data

The bubbleplot, as we'll see, brings an additional dimension to your visualizing by varying the size of your points based on a (usually) continuous variable. This can help to highlight underlying data trends in a more obvious fashion for your audience.

To achieve this we will set the size parameter in our aesthetics aes() layer. By setting the size parameter to a variable in our data, it will alter the size or our data points. We will augment the results of this by providing a size range through the scale_size() layer.


2.2.0 Scale your axes to help see your data better

From above we can see a lot of values are bunched up together at the lower end of our scales and likewise we have a few values that are far out on our x/y axes. To even this out, we can transform our axes to display values on a log10 scale. We have two places where we can accomplish this:

  1. Transform your data directly ie log(case_count). Data will be placed on a log value scale but may have less meaning to the lay person.
  2. Transform your scales/tick marks. The data is untouched and the values still have meaning to your audience without having to do conversions in their head.

From our bubbleplot it is clear that rising case counts are tied to increasing population size. The relationship between our log-transformed data appears to be linear.

How do you interpret log-transformed data?: After log-transforming both axes it looks like the relationship between overall cases and hospitalizations is linear. While the transformation has definitely improved the visualization, it has made the interpretation a little harder. Overall, in this situation, you'll either want to model against the untransformed data or do the math correctly and recognize that the relationship is a power function:
\begin{equation*} \begin{aligned} \text{log} y &= B + a \text{log} x &\text{where B is the intercept of the vertical axis and a is the slope}\\[10pt] y &= 10^{(B + a \text{log} x)}\\[10pt] y &= 10^{B} 10^{a \text{log} x}\\[10pt] y &= 10^{B} x^{a} &\text{Recall that } y \text{ln} x = x^{y}\\[10pt] \end{aligned} \end{equation*}

3.0.0 Distribution plots display frequency and spread across groups or intervals

Many students are familar with the classic bar chart. Thus far, we've already used it to some extent to look at our last lecture's PHU case data. When comparing groups or populations for differences in their distribution, however, the bar chart falls quite short of the ideal.

When comparing distributions, we again are often concerned with summary statistics - mean, median, standard deviation, confidence intervals. The nature of our data, however, is often not continuous across an entire y-axis interval as a bar chart may suggest but rather our population is the result of values centred, with some variance, around a mean value.

Over the remainder of the lecture we will compare and contrast 4 types of distribution plots for their relevance and when they may be most useful to visualize our data. When applicable to our data visualizations, we will also display all of our data points to better illustrate our distribution versus the visualization. We will examine:

  1. Bar charts or Barplots
  2. Density plots
  3. Box and Whisker plots
  4. Violin plots

3.1.0 The barplot uses the geom_bar() or geom_col() to display data

As we have already seen, the geom_bar() layer can be used to display our data in multiple ways. The height of the bar is proportional to either the number of observations of each group, or by applying a weight aesthetic to produce the sum of the weights. In our previous examples we actually used the value of our observations to determine height and proportions of our groups by setting the parameter stat="identity". A simpler way to accomplish this is with geom_col(), which already uses stat_identity() by default to calculate proportions. </br> </br>

geom Description Requirements
geom_bar() Counts observations in your data and (by default) determines height as a proportion of total (by default) Only accepts x OR y parameter in aes()
geom_col() Uses y-axis values as the height of each bar. Requires both x AND y parameters in aes()

</br> </br>

Both of these plots use position="stack" by default and proportions of height match observations or sums for multiple values sharing the same x position. Such instances can be displayed independently using position="dodge" or position=dodge2. This is only helpful, however, when the number of x values (ie categories) is lower, otherwise the graph becomes crowded.

Let's begin with a simple barplot of SARS-Cov-2-related deaths accumulated over the course of the pandemic. Using phu_information.df as a data source we will convey the distribution of total deaths per PHU using the geom_col() layer.

Recall the first row of our data is "Ontario", representing the total values for each variable as a sum of the other PHUs. We'll remove that from our data using slice() before passing it on to ggplot().


3.1.1 Use reorder() to sort your data as you plot it

In the case of phu_information.df we can do a little better by ordering our x-axis of PHUs by total population or death counts. Since our data table is quite simple and only a single observation for each PHU occurs, there are a number of ways to accomplish a sort:

  1. Convert population to a factor and use fct_reorder() to sort our factors.
  2. Sort the entire table using arrange() before plotting.
  3. Use the reorder() function while generating the plot.

Let's sort our data on the move in our next example using the R base function reorder().


3.2.0 Barplots can help guide the eye

From above, using the baplots we can quickly guage the difference between PHUs as they get sorted by population. There's not much need to colour by population as well but it does add some emphasis to Toronto as a larger population and further enforces the idea that we have sorted by population size.

What are we missing from this visualization that would help the reader It would be nice to know just how much variation there is in population size, especially in the first ~12 PHUs. How much bigger is Peel versus the Region of Waterloo?

3.2.1 Use the fill parameter to distinguish between groups

One simple way to enhance our barplot is through the use of fill colour. By setting the fill parameter to population we can shade each bar based on the population size of each PHU. Let's take a look.


3.2.2 Add extra geom_* layers to more accurately reflect values

As you can see from above, unless you have a very good eye for shades, there isn't much to help us distinguish between populations again. We can still see that Toronto has closer to more than 3M residents but other than that, we again have some issues with discriminating between populations.

Instead of fill colour, we can add a little detail by layering an additional geom. In this case, we will use geom_point() to add the population of each PHU after scaling it down to 1/1000th.

Why scale population by 1/1000th? Why did we choose to scale the population size of each PHU to 1:1000? This is more of a decision based on what we know about the data alread. We already know our y-axis deaths_count data scales from 0 to 4000. Knowing the populations can scale from 0 to 3M, it makes sense to try and get our values along the same scale.

3.2.3 The lollipop plot: a sweet twist on the barplot.

Now we have more clarity on populations sizes between PHUs, answering our original question that Peel region has nearly one million more inhabitants than the Region of Waterloo.

On this same topic, let's visit one last plot that can gives us some pieces from the two variants of barplots we've used. The lollipop graph clarifies that x-axis values are more singular in nature rather than spanning a range while still visually connecting those y-axis values to their x-axis categories.

It looks very much like it sounds, and we'll add an extra twist to ours by setting the point size to the population size, giving it just a bit more of informational dimension. To accomplish this visualization we'll combine a geom_point() with a geom_segment().

Aesthetics when working with multiple geoms: As you'll see in the following code, we've made some adjustments due to working with multiple geom_* layers. Since there can be overlapping aesthetics between geoms, you need to be cognizant of their effects. Rather than set these parameters in the aes() layer, set them directly in their respective geoms.

3.3.0 Barplots to convey proportion or composition

Let's revisit our public health unit data from last lecture. We'll use an updated long-format version that we created and saved in a csv format. As you might recall this dataset contains a column of dates with each row representing an observation of new_cases reported by a public_health_unit on that date. We have a 4th column total_phu_new representing total cases reported across all PHUs on that specific date. </br> </br>

date public_health_unit new_cases total_phu_new
Date format: YYYY-MM-DD factor of 34 PHUs numeric:double numeric:double
... ... ... ...

</br> </br>

In our first lecture we looked at the other helpful visualization that barplots can produce: composition and proportions. Here we can combine the proportions of categories of data to help convey an added dimension to our data. By stacking PHUs in our "new case" data we are now plotting new cases per month for multiple PHUs. The area of each stack, however, also gives us a sense of proportions for each PHU that we've added.


3.4.0 Convert your plots to a circular layout with coord_polar()

Generally speaking there are a number of circular or polar coordinate plot variants ranging in complexity from the pie chart to a racetrack chart. The coord_polar() layer takes a few parameters:

Here's a summary

Chart name Plot description Uses Cartesian analog
Pie chart Sliced up proportions of a circle Simple proportion comparison between groups single bar variant
Nightingale/Coxcomb plot Intervals are equal wedges but shaded areas represent proportions Intervals with additional categorical proportions Stacked column plot, theta=x
Race track plot Intervals are split concentrically to form rings of equal width Helpful if some groups are less diverse Stack column plot, theta=y

3.4.1 Pie chart work best with simple data

From our pie chart description and from your own experience, you can recall that pie charts aren't helpful with complex data. Once the number of categories surpasses 6-8 groups, things can get hard to visually digest - especially if one category dominates the others.

Let's use phu_information.df to assess the cumulative COVID-19 case data from across our PHUs to look at the top 4 PHUs by caseload. We'll convert a geom_bar() barplot to our pie chart using the coord_polar() layer.


3.4.2 Nightingale rose charts emphasize proportions

Made famous by Florence Nightingale, these plots (which she named coxcomb plots) were used to help emphasize the proportions of soldier deaths due to infection during the Crimean war. Each circular increment is equally spaced to represent increasing values along the y-axis. Visually, this gives more area representation as we radiate outwards on the chart. This produces a chart that

  1. makes distinguishing larger proportions more obvious.
  2. adds a new dimension to the pie chart, allowing us to create intervals for comparison.
  3. helps connect our start and end category, especially if they don't represent an ordinal or continuous scale.

Caution: outer segments disproportionately represent increases in value. Their larger area produces more visual emphasis despite the linear scale of the y-axis.

We'll switch back to our PHU daily case data from covid_phu.df to generate some stackable information.


3.4.3 Racetrack plots are an aesthetic variant of the bar chart

You want to make another cool chart from a boring bar chart. Stacked or otherwise, you can convert those bar plots to a racetrack or radial bar chart. The only difference between the coxcomb plot and radial bar chart is that instead of the default theta="x" we set it to y. Like flipping the coordinates for a bar chart.

Beware the racetrack transformation: This is an aesthetic transformation where each bar gets relatively longer as you move from inside to out. Therefore, values must be judged by radian! Our eyes can more precisely judge differences on a Cartesian bar chart. Thus when viewing these types of charts, don't be misled by how they look at a casual glance!

Let's make a set of barchart data and compare it to the racetrack plot. Note that negative values in our dataset are plotted separately. It is an implementation quirk of ggplot.


3.4.4 Barplots and their variants summarize the distribution of data between groups or intervals but NOT within single populations.

Often for our purposes as scientists we are more concerned with the distribution of replicates or measurements as a population within a group whilst also comparing across other groups (ie control vs treatment). Naively, we are tempted by the convenience of Excel to simplify our data as a bar chart with some simple standard deviation or error bars. One word concisely describes this behaviour:

$$\text{Inappropriate}$$

Note from our examples that although we are comparing public health units, our datapoints represent single or total observations separated either by group or by time. At each x-axis point we are not comparing the distribution between categories in a meaningful way. In fact, from our first example, we can convey a near-exact message using a dot-plot! In essence these visualizations are communicating the categorization of samples across multiple groups. More or less, this is about visualizing proportions.

Section 3.0.0 Comprehension Question: After exploring the racetrack variant above, visualizing our data ranging from December 2020 to present, what is the biggest issue you might see? What would be one way to remedy this? Is this an appropriate visualization of our data?

4.0.0 Density plots compute a theoretical distribution

What is the shape of our data? When we encounter experimental populations and begin sampling or measure for specific characteristics, we inevitably begin to reveal whether or not that characteristic has a predictable range, median value, etc. Density plots, once given enough sample values, begin to predict the shape or distribution of that sampling. We sometimes use this to represent the theoretical distribution of our original populations.

4.0.1 Distribution plots need appropriate data

In contrast to our previous datasets, we are interested in dissecting group characterstics after sampling multiple times. Therefore, before continuing on our journey, we need to find some data that is better analysed as population data. Let's open up some more SARS-CoV-2 data from Ontario PHUs that has been broken into age and sex demographics. Before we can do any visualization we'll want to tidy up the data in the following steps:

  1. Import the data with read_csv()
  2. Fix the column names to remove spaces and replace them with "_"
  3. Pare down on the columns we want to look at. In this case we'll be focusing only on three indicators: total cases, deaths, and hospitalizations regardless of sex.
  4. Summarize the data into proportion of cases for each age group, based on the specific total for each indicator within each PHU.

After all of our transformations, we'll have specific proportions of infections, deaths, and hospitalizations within each PHU for each age group.


4.1.0 Density and histogram plots model distribution of samples or individuals within a population

While bar plots help to focus on proportional representation for categorical data, both density plots and histograms can be used to convey the frequency or distribution of values for a variable across its specified range. When comparing distributions visualized this way, you can usually compare up to 3 or 4 on the same plot before the data becomes too crowded. These methods also give you a sense of your data before moving forward with more detailed analyses. You can also identify possible mistakes or artefacts in data collection.

How much data do I need? Density plots can be thought of as smoothed versions of histograms which have been binned in small intervals. Density plots of a single dimension require a minimum of 4 samples but justifying a KDE on a sample size that small is hard. Histograms are recommended to have at least 30 observations to be considered useful and I would apply this rule of thumb to KDEs as well.

Let's pick a few age groups to plot based on percent_cases. We'll use data gathered from each PHU to see if the same general trends (if any) apply regardless of PHU.


4.1.1 Apply a facet_*() to view KDEs separately

As you can see from above, as we start to have more and more age groups, it may be better to separate them out in order to avoid too much overlap. It may also be to our advantage to compare them in a more vertical fashion. Recall that there are two layers we can choose from: facet_wrap() and facet_grid(). They are differentiated by the following characteristics:

Let's use facet_wrap() to generate a single-column facet of KDEs.


4.2.0 Plot multiple distributions with a ridgeline plot

Ridgeline plots (sometimes called Joyplots) can generate a compact way to show multiple distributions of numeric values across several groups. The distributions can be shown as either histograms or density plots with all of them aligned to the same horizontal axis. The vertical axis is compressed slightly to generate an overlap between categories.

To generate these visualizations we can use the ggridges package which is an extension of ggplot2. In this case, that means it uses the same grammar and can be added as a layer call to geom_density_ridges(). A parameter to keep in mind:

Let's begin by replicating our facet plot from above.


4.3.0 Use geom_density_ridges_gradient() to fill densities on a gradient

From above you can now see that we've somewhat compacted all that dimensional data into a single plot that still clearly conveys the difference in overall proportions for total infections within each age group. The distributions across our categories suggest that the 20-39 age group makes up a larger proportion of overall cases in each PHU. For our audience, we would need to clean up our axis titles to clarify that these proportions are calculated independently.

An additional option with your ridgeline plots is the fill variant. To accomplish a nicer gradient we will include a call to scale_fill_viridis_c since our x-axis is continuous. Keep in mind that we also cannot set the alpha transparency on our density plots when filling with a gradient. We also have to set our aesthetics fill to stat(x) to accomplish this feat as well.


5.0.0 Categorical distribution plots

Did you know the boxplot is nearly 50 years old! First invented in the 1970s by our favourite statistician, John Wilder Tukey, we'll dig into how and when to use this iconic plot. While we're here we'll also take a look at other categorical distribution plots. While our KDE and ridgeline plots provide quite a bit of detail, they can also be a little more limited in their space efficiency. The following categorical distribution plots will perhaps provide some more information efficiency.

5.1.0 Summarize population distributions with geom_boxplot()

As you can see from the previous section, we comfortably fit a quite few distributions on a ridgeline plot. From the looks of it, the 20-39 age group looks to make up a higher percent of cases across all the PHUs. Previously this data was broken down by 10-year groupings but it has since been amended to make larger age groups in the 20+ range. Still, we can still explore this data a little closer.

Can we visualize the data in a more summarized form? Let's explore the box plot.

boxplot_dissection.png
The dissection of a boxplot's components shows us how it summarizes data distribution.

Also known as the box and whisker plot, this visualization conveys the distribution of samples within a group or population and is built upon 5 principal values:

Together, the lower and upper quartiles produce the interquartile range (IQR). The general implementation of boxplots classify any observations 1.5 IQR above the upper quartile or below the lower quartile as outliers of the distribution. The characteristics of outliers can be set as parameters within the geom_boxplot() layer. Parameters include outlier.shape, outlier.size, and outlier.colour.

Unlike a histogram, the minimum number of values to generate a boxplot is 5. While you could generate a boxplot on fewer numbers, you might not have actual whiskers! This is definitely a great alternative when sample sizes are between 5-30 for each population.

Historically this was a simple way to visualize summary statistics of population while being easy to produce by hand. Of course, with the age of computing, the production of kernel density estimates have allowed for more diverse visualizations. This plot, however, remains a popular format and thus is more readily understood by general audiences.

Each compact box can take up the same space as a barplot column but it gives much more information about the population. Let's look at a single aspect - percent_cases.


5.2.0 Upgrade your boxplot with some confidence intervals and values

From the looks of it, we can confirm what we saw before in our density plot - that the 20-39 age group generates the highest percentage of cases across PHUs and that with increasing age, the number of reported cases decreases as a proportion of the total.

We can add a few extra items to the plot to help us visualize the data:


5.3.0 geom_beeswarm takes your plotting your points up to the next level

As you can see from above the geom_jitter() layer does add points to our boxplot by plotting the points such that they avoid overlapping as much as possible. Points are restricted to the width of the boxplot although this can also be adjusted to some degree with the right parameters. geom_jitter() is native to the ggplot2 package with some parameters that allow for a more "random" distribution of your data points within a provided area.

The goal of the ggbeeswarm package is to generate points that will not overlap but they can also be used to simultaneously simulate the kernel density of your data. There are two geoms supplied that work with the ggplot2 package to accomplish this:

geom_beeswarm() has a number of parameters can be used to set their aes() mappings but also how the points are laid out.


5.3.1 geom_quasirandom() adds KDE structure to your plotting

As we can see from above, the geom_beeswarm() layer adds a little more structure to the data in a somewhat rigid way. Any datapoints that are near each other are deliberately spaced out to almost represent the distribution of your data. Of course, you may run into some issues as your number of datapoints increases or as your data range increases (see the 20 to 39 age group).

To remedy this, we can balance the visualization a bit with the geom_quasirandom() function. geom_quasirandom() works similarly to the beeswarm function with emphasis on an additional method of how the points are plotted:


Now our data points take a more nuanced approach with a uniform width that shapes the data as a distribution. We'll see this with even more emphasis in a few sections.

beeswarm_joy.png
When given the need to show data distribution, try the quasirandom plotting of points over a simple beeswarm.

5.4.0 Add a third dimension to your box plot with fill

Is there more to the data we've visualized? We can add a third dimension in a number of ways but the simplest would be to compare the proportions of total cases vs totals deaths over the course of this pandemic. To do so, we can pivot our dataset to collapse percent_cases and percent_deaths together into a single set of variables.

From there we'll use the fill parameter to generate different subgroups in our boxplot to make a grouped boxplot. In doing so, we'll also have to add the use of the geom_quasirandom() parameter dodge.width. It will split the data points by any aesthetic groups that have been assigned.


In our boxplots, we are plotting the data in both a boxplot and dotplot format. The shape of the dots helped to give a better sense of PHU distribution within each age group. You can see that the data points overlay on the box but also fall outside. Can we get the compact nature of the boxplot while still getting the visual appeal generated by a density plot?


5.5.0 Violin plot - the lovechild of density and boxplots

As the title says, the violin plot is a mixture of both the boxplot and kernel density plot. It's a miniaturized KDE that is mirrored across its axis. It encompasses the summary information of the boxplot but in a pear or violin-shaped distribution. To generate a geom_violin() in ggplot, a minimum of three values are required. To justify using a violin, I would again suggest sticking to a similar rule of thumb of a minimum ~30 samples/observations to ensure an accurate representation of the distribution.

The nuanced visualization of a violin plot gives much more information than the box plot itself and most boxplots can be replaced with a violin plot. Despite the gateway to more detailed distribution information, this format remains less popular/familiar to scientists. Therefore its immediate accessibility to your audience can be limited.

An important parameter of this geom is:

Outliers and violins: Much like a KDE, the theoretical distribution of a violin plot can generate some impossible values - especially at the tails. Remember that this is a theoretical distribution based on the data supplied. Depending on how much variation is in your data, and how many outliers it has, it can really affect the shape of your violin.

5.6.0 Violin plots represent distributions and boxplots summarize them

The major advantage to the violin plot is that, by it's nature, it is very sensitive to the distribution that produces the density estimate. The boxplot represents the summary information of a distribution but is always a visual representation of normal distribution. There are not enough parameters supplied to represent anything more complex!

The violin plot is not limited in that respect. Despite some of it's visual caveats, it can certainly detect multi-modal data. Let's make a toy example to illustrate.

Note that in the geom_quasirandom() layer, we'll employ the groupOnX parameter to set grouping by the y-axis values (FALSE) versus the x-axis (TRUE).


5.7.0 Combine violin and boxplots into the ultimate plot

From our above example, you can see that we blended a number of geoms together. With a little working around, we can also plot both violins and boxplots together in a multivariate setting! This gives us the familiarity of the boxplot but also clearly displays the theoretical distribution. Some steps to accomplish this:

  1. To put emphasis on the violin plots we set the scale paramater to "width".
  2. We need to adjust some of the boxplot paramters to fit them within the violin plots
  3. Some adjustments to the jitter parameters for our geoms. This time we will directly use geom_point()
Section 5.7.0 Comprehension Question: Looking at the above code for our graph, why do you think we set the aes(fill/group = stat_group) aesthetic individually for some layers rather than directly in the aes() layer?

5.8.0 Parallel coordinate plots can help visualize multivariate date

While the above visualization shows with some clarity the total distribution of our age groups, the messiness of outliers has invaded into some of the violin plots themselves. An experienced eye, however can still see that the bulk of the population centres around our internal boxplots offset by the stretched out look of our violins. We could, of course clean it up by removing some of the smaller PHU populations or removing outliers ahead of time.

Suppose, however, we wanted to add a third level to our factor like percent_hospitalizations? With 7 age groups represented, things would start to get very crowded. To accomodate all of that data, you could facet it into 3 groups based on the statistic used but this would separate the data, which looks sharper when you can see it all on a single plot.

Organizing multiple groups, across multiple categories is the domain of the parallel coordinate plot. The GGally package is a ggplot2 extension with the ggparcoord function which allows us to look simultaneously at our 3 indicators (total cases, deaths, and hospitalizations) for each age group, linking each PHU. For each PHU at each indicator, we will draw a line connecting all values across the age groups (coordinates). We will colour our lines based on the category of the indicator.

Although GGally is an extension for ggplot2, we actually have to wrestle with our data a little bit more and put it back into a wider format. Otherwise we can treat the plot similarly after making it by changing themes etc.

The parameters for ggparcoord() require:

To generate this visualization, we'll want to:

  1. remake our list of PHUs ordered by descending case load.
  2. Pivot our data to wide-format, giving each age group it's own column in our data

From our above data we see some very interesting points looking at our 60-79 and 80+ categories.

  1. These two groups have the highest share of mortality across all age groups.
  2. Conversely, these groups have lower shares of actual SARS-CoV-2 cases.
  3. The share of hospitalization rates in each PHU is also lower in our 90+ group.

Remember all of this data represents the proportion or share of each age group for each particular indicator. If, however, we really want to understand which age group has the worst hospitalization or death rate, we need to drill down into each age group.

Let's generate one last visualization and look at the probability of death after contracting SARS-CoV-2. We'll want to calculate those values and then graph them.

Note: We are working with cumulative values across the entire pandemic - this doesn't give us a sense of the effect of external effects such as vaccination rates, or differences between variants!


6.0.0 Class summary

We've covered a number of key plots today, including when and how to use them. Next week we'll revisit some of these plots and spruce them up with extra touches that will take them that extra distance. Below you'll find a summary of what we've discussed today.

6.1.0 Summary of plots

Plot Key Features Notes
Scatterplot Good for exploring relationships between variables Bubbleplots add an extra dimension to your data
Barplot Present values across groups. Presenting proportions, small sample sizes
Stack categories for extra dimension Does not dissect individual distributions
Nightingale plot Circular-wedge barplot, same properties as barplot Presenting data over unordered groups
Visual emphasis on outer area size may mislead reader
Racetrack plot Circular-ringed barplot, same properties as barplot Looking for a more compact way to show barplots
Calculate length by radians as outer rings are "stretched"
Density plot Theoretical distribution of your sample data Minimum sample size 4 but 30 is more reliable
Can plot up to 5 distributions on same axis
Tails can produce "ghost" data
Ridgeplots Allows tighter visualizations of multiple densities Good way to pack more KDEs into a smaller area
Same properties as KDEs No real control of outliers
Similar "ghost" data issues
Box and whisker plot Summarize distributions with 5 parameters Popular and compact presentation of simple populations
Minimum sample size = 5
Does not properly visualize multi-modal data
Violin plots Boxplot format with KDE violin shape Compact representation of distribution shape
Less popular with nuanced interpretation
Inherits "ghost" data and other properties of KDE
DOES interpret multi-modal populations
Parallel coordinate plot Visual representation of multivariate data Connects trends across groups
Related data can be connected linearly Not limited by number of samples
Can help identify trends within multicategorical data

6.2.0 Weekly assignment

This week's assignment will be found under the current lecture folder under the "assignment" subfolder. It will include a Jupyter notebook that you will use to produce the code and answers for this week's assignment. Please provide answers in markdown or code cells that immediately follow each question section.

Assignment breakdown
Code 50% - Does it follow best practices?
- Does it make good use of available packages?
- Was data prepared properly
Answers and Output 50% - Is output based on the correct dataset?
- Are groupings appropriate
- Are correct titles/axes/legends correct?
- Is interpretation of the graphs correct?

Since coding styles and solutions can differ, students are encouraged to use best practices. Assignments may be rewarded for well-coded or elegant solutions.

You can save and download the Jupyter notebook in its native format. Submit this file to the the appropriate assignment section by 1:59 pm on the date of our next class: March 17th, 2022.


6.3.0 Acknowledgements

Revision 1.0.0: created and prepared for CSB1021H S LEC0141, 03-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.0.1: edited and prepared for CSB1020H S LEC0141, 03-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


6.4.0 References

  1. Public Health Ontario data: https://www.publichealthontario.ca/en/data-and-analysis/using-data/open-data

  2. Anscombe's Quartet: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

  3. Choosing the right chart type: https://extremepresentation.typepad.com/blog/2006/09/choosing_a_good.html

  4. A layered grammar of graphics by Hadley Wickham: http://vita.had.co.nz/papers/layered-grammar.html

  5. read_delim() column types and information: https://readr.tidyverse.org/reference/read_delim.html

  6. Pie chart rules: https://www.1ka.si/d/en/help/faq/what-is-the-maximum-number-of-categories-to-display-using-pie-charts

  7. Kernel density estimation: https://stats.stackexchange.com/questions/76948/what-is-the-minimum-number-of-data-points-required-for-kernel-density-estimation

  8. Ridgeline plots: https://www.data-to-viz.com/graph/ridgeline.html

  9. Visualizing samples with boxplots: https://www.nature.com/articles/nmeth.2813.pdf?origin=ppub#:~:text=Whereas%20histograms%20require%20a%20sample,render%20it%20even%20more%20informative.

  10. Reasons to use a violin plot: https://blog.bioturing.com/2018/05/16/5-reasons-you-should-use-a-violin-graph/

  11. Parallel coordinate plot parameters: https://www.rdocumentation.org/packages/GGally/versions/1.5.0/topics/ggparcoord

  12. Lots of more plots in R!: https://www.r-graph-gallery.com/index.html

CAGEF_new.png